top-1 acc
tTake out?Ground truth: Put down a cheeseGround truth: Take out a sauce(a) Importance of spatialunderstanding(b) Importance of temporalunderstandingtPut in?ttPut in?Milk carton? Cheese? Ketchup?
Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics. The code is available at https://github.com/KHU-VLL/CAST.
A Additional results and experiment details A.1 Detailed results on ImageNet-C
In Table 4, we list the mCE of each corruption category. We apply our method to other network architectures and evaluate on the task of image classification. Datasets in Table 6 are the same as in Table 1. Intuitively, when testing on data whose distribution is "close" to the training data, using the main In this work, we take a naive measurement for the "closeness" of an ImageNet We process the entire ImageNet validation set using the visualization technique introduced in Section 3. We do not shuffle the ImageNet validation data when generating these batches. Table 8 shows the classification performance of various models on the two ImageNet-AdvBN variants, denoted as IN-Adv-VGG and IN-Adv-ResNet respectively.
d072677d210ac4c03ba046120f0802ec-AuthorFeedback.pdf
We respond to the concerns point-by-point as below. Why distilling prioritized paths improves architecture rating? The more sufficient/full training of subnets leads to a more accurate architecture rating [6](Sec.4.3). The set used to train the matching network? We will revise the manuscript to make this point clearer.